[4/7] Telemetry Event Emission and Aggregation#327
Conversation
|
The emission format confirms to the telemetry proto, marked this ready for review. |
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
87d1e85 to
32003e9
Compare
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack. Components: - TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter - MetricsAggregator: Per-statement aggregation with batch processing TelemetryEventEmitter: - Event-driven architecture using Node.js EventEmitter - Type-safe event emission methods - Respects telemetryEnabled configuration flag - All exceptions swallowed and logged at debug level - Zero impact when disabled Event Types: - connection.open: On successful connection - statement.start: On statement execution - statement.complete: On statement finish - cloudfetch.chunk: On chunk download - error: On exception with terminal classification MetricsAggregator: - Per-statement aggregation by statement_id - Connection events emitted immediately (no aggregation) - Statement events buffered until completeStatement() called - Terminal exceptions flushed immediately - Retryable exceptions buffered until statement complete - Batch size (default 100) triggers flush - Periodic timer (default 5s) triggers flush Batching Strategy: - Optimizes export efficiency - Reduces HTTP overhead - Smart flushing based on error criticality - Memory efficient with bounded buffers Testing: - 31 comprehensive unit tests for TelemetryEventEmitter - 32 comprehensive unit tests for MetricsAggregator - 100% function coverage, >90% line/branch coverage - Tests verify exception swallowing - Tests verify debug-only logging Dependencies: - Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack. Components: - TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter - MetricsAggregator: Per-statement aggregation with batch processing TelemetryEventEmitter: - Event-driven architecture using Node.js EventEmitter - Type-safe event emission methods - Respects telemetryEnabled configuration flag - All exceptions swallowed and logged at debug level - Zero impact when disabled Event Types: - connection.open: On successful connection - statement.start: On statement execution - statement.complete: On statement finish - cloudfetch.chunk: On chunk download - error: On exception with terminal classification MetricsAggregator: - Per-statement aggregation by statement_id - Connection events emitted immediately (no aggregation) - Statement events buffered until completeStatement() called - Terminal exceptions flushed immediately - Retryable exceptions buffered until statement complete - Batch size (default 100) triggers flush - Periodic timer (default 5s) triggers flush Batching Strategy: - Optimizes export efficiency - Reduces HTTP overhead - Smart flushing based on error criticality - Memory efficient with bounded buffers Testing: - 31 comprehensive unit tests for TelemetryEventEmitter - 32 comprehensive unit tests for MetricsAggregator - 100% function coverage, >90% line/branch coverage - Tests verify exception swallowing - Tests verify debug-only logging Dependencies: - Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
This is part 5 of 7 in the telemetry implementation stack. Components: - DatabricksTelemetryExporter: HTTP export with retry logic and circuit breaker - TelemetryExporterStub: Test stub for integration tests DatabricksTelemetryExporter: - Exports telemetry metrics to Databricks via HTTP POST - Two endpoints: authenticated (/api/2.0/sql/telemetry-ext) and unauthenticated (/api/2.0/sql/telemetry-unauth) - Integrates with CircuitBreaker for per-host endpoint protection - Retry logic with exponential backoff and jitter - Exception classification (terminal vs retryable) Export Flow: 1. Check circuit breaker state (skip if OPEN) 2. Execute with circuit breaker protection 3. Retry on retryable errors with backoff 4. Circuit breaker tracks success/failure 5. All exceptions swallowed and logged at debug level Retry Strategy: - Max retries: 3 (default, configurable) - Exponential backoff: 100ms * 2^attempt - Jitter: Random 0-100ms to prevent thundering herd - Terminal errors: No retry (401, 403, 404, 400) - Retryable errors: Retry with backoff (429, 500, 502, 503, 504) Circuit Breaker Integration: - Success → Record success with circuit breaker - Failure → Record failure with circuit breaker - Circuit OPEN → Skip export, log at debug - Automatic recovery via HALF_OPEN state Critical Requirements: - All exceptions swallowed (NEVER throws) - All logging at LogLevel.debug ONLY - No console logging - Driver continues when telemetry fails Testing: - 24 comprehensive unit tests - 96% statement coverage, 84% branch coverage - Tests verify exception swallowing - Tests verify retry logic - Tests verify circuit breaker integration - TelemetryExporterStub for integration tests Dependencies: - Builds on all previous layers [1/7] through [4/7]
| telemetryMaxRetries: 3, | ||
| telemetryAuthenticatedExport: true, | ||
| telemetryCircuitBreakerThreshold: 5, | ||
| telemetryCircuitBreakerTimeout: 60000, // 1 minute |
There was a problem hiding this comment.
[F9 — High] Default-config drift: 7 keys here vs 15 in DEFAULT_TELEMETRY_CONFIG
getDefaultConfig (this method, lines 128-134) defines 7 telemetry keys. DEFAULT_TELEMETRY_CONFIG in lib/telemetry/types.ts:91-103 defines 15. Components like MetricsAggregator fall back through ?? DEFAULT_TELEMETRY_CONFIG.X for the missing keys. Today the values agree by hand; they will silently desync the next time someone changes one source but not the other.
Fix: spread DEFAULT_TELEMETRY_CONFIG here (with a typed mapper for the telemetry-prefixed ClientConfig keys), or stop maintaining the inline copy and route every component through the single frozen const.
Posted by Code Review Squad.
There was a problem hiding this comment.
Done in 5669c30 — getDefaultConfig() now sources all 15 telemetry keys from DEFAULT_TELEMETRY_CONFIG (with the unprefixed → telemetry-prefixed key mapping done once). Single source of truth; adding a new knob now requires one change to the frozen const and the public option surface.
| for (const k of telemetryOverrides) { | ||
| if (options[k] !== undefined) { | ||
| // The narrow union forces a cast; values are validated at point of use. | ||
| (this.config as any)[k] = options[k]; |
There was a problem hiding this comment.
[F11 — Medium] as any cast removes type safety on every telemetry option override
The telemetryOverrides array above is as const so the keys are a literal union. Both ConnectionOptions and ClientConfig have identical key names and types for the telemetry knobs. The cast escapes the structural type system for no reason. A user passing telemetryBatchSize: "100" (string) silently writes a string into a number field; MetricsAggregator then reads it as number and aggregation thresholds break at runtime.
Fix: per-key narrowed assignments, or a small typed pickDefined<T, K>(dst: T, src: Partial<T>, keys: readonly K[]) helper. Same readability, no cast.
Posted by Code Review Squad.
There was a problem hiding this comment.
Done in 5669c30 — replaced the as any cast with a typed copyDefinedTelemetryOptions(src: ConnectionOptions, dst: ClientConfig) helper that performs per-key narrowed assignments. A wrong-shape value (telemetryBatchSize: "100") now fails the TS check at the user call site instead of silently writing a string into a number field.
| /** @internal */ | ||
| public getTelemetryAggregator() { | ||
| return this.telemetryClient?.getAggregator(); | ||
| } |
There was a problem hiding this comment.
[F18 — Medium] getTelemetryAggregator() lacks explicit return-type annotation
Three reviewers flagged this. The return type is inferred as MetricsAggregator | undefined (from this.telemetryClient?.getAggregator()), but MetricsAggregator is intentionally not re-exported from lib/index.ts. Consumers calling client.getTelemetryAggregator() see a method whose return type references a non-exported symbol — autocomplete degrades, agent-completion gets confused, and the contract diverges from IClientContext.getTelemetryAggregator?(): MetricsAggregator | undefined even though the line two above (getTelemetryEmitter()) does annotate explicitly.
Fix: add : MetricsAggregator | undefined to match IClientContext. Since the method is @internal, an even cleaner option is to remove it from the public class entirely (move to a friend interface used only within the package) — TS-public methods marked only with a JSDoc @internal are the worst of both worlds for autocomplete.
Posted by Code Review Squad.
There was a problem hiding this comment.
Done in 5669c30 — explicit getTelemetryAggregator(): MetricsAggregator | undefined annotation matching IClientContext. Also added a public companion getTelemetryStats(): {pendingMetricsCount, inFlightStatements, droppedMetrics, evictedStatements, circuitBreakerState, ...} | undefined (F13) so operators can read telemetry health without grepping debug logs.
| // charge — avoiding the footgun where a sysadmin sets the var to "false" | ||
| // expecting to enable telemetry. | ||
| const envKill = process.env.DATABRICKS_TELEMETRY_DISABLED; | ||
| const envDisabled = typeof envKill === 'string' && /^(1|true|yes|on)$/i.test(envKill.trim()); |
There was a problem hiding this comment.
[F22 — Medium] DATABRICKS_TELEMETRY_DISABLED=false keeps telemetry on — silently ignored
The regex /^(1|true|yes|on)$/i is the deliberate "footgun-avoidance" choice (so false/0/no don't accidentally disable opt-out), and the comment above explains it. The remaining UX problem: an ops engineer who sees the env var and tries to "set it to false to keep telemetry on" gets the opposite of what they expect — false is silently ignored, leaving runtime config (default true) in charge.
Fix: log a LogLevel.warn line when the env var is set to a non-recognized non-empty value (false, 0, no, off, etc.) so ops can see their disable didn't take effect. That preserves the safe-default behavior while surfacing the misconfiguration.
Posted by Code Review Squad.
There was a problem hiding this comment.
Done in 5669c30 — when DATABRICKS_TELEMETRY_DISABLED is set to a non-empty value that isn't in the recognized truthy set (1/true/yes/on, case-insensitive), the driver logs a LogLevel.warn line stating the value was ignored and listing the recognized forms. The safe-default behavior is preserved (no accidental opt-in), but ops can now see the disable-failed shape in logs. Unit tests cover all the unrecognized values (0/false/no/off/False/OFF) and the recognized ones.
| if (arrowBatches) { | ||
| for (const batch of arrowBatches) { | ||
| bytes += batch.batch?.length ?? 0; | ||
| } |
There was a problem hiding this comment.
[F10 — Medium] RowSetProvider.emitChunkEvent undercounts bytes for non-arrow paths
bytes is computed only from response.results?.arrowBatches. A TFetchResultsResp carrying inline JSON / column-based data (COLUMN_BASED_SET) reports bytes: 0. URL-based result sets report bytes via CloudFetchResultHandler (a separate emit site), so URL-based is fine. But column/json paths flow through here and the metric is then aggregated into bytesDownloaded in MetricsAggregator.processStatementEvent (details.bytesDownloaded += event.bytes ?? 0).
Net effect: dashboards that aggregate bytesDownloaded will be silently undercounted whenever the result is inline JSON / column-based.
Fix: compute and account inline-result bytes here for non-arrow paths, or document explicitly that bytesDownloaded from this emit site counts only arrow-batch payloads (so dashboards know to add the CloudFetch-emitted byte counts separately).
Posted by Code Review Squad.
There was a problem hiding this comment.
Done in 5669c30 — emitChunkEvent now sums bytes across every TFetchResultsResp shape: arrowBatches[i].batch (existing), binaryColumns, and per-column values via a new estimateColumnBytes helper that handles all 8 TColumn value types plus their nulls buffers. URL-based result sets continue to emit from CloudFetchResultHandler.emitCloudFetchChunk with the post-download byte count. New tests cover both COLUMN_BASED_SET and binary-column paths.
| safeEmit(this.context, (emitter) => { | ||
| emitter.emitError({ | ||
| statementId: this.id, | ||
| sessionId: this.sessionId, |
There was a problem hiding this comment.
[F12 — Medium] sessionId default mismatch — '' here, undefined in emitErrorEvent
emitStatementStart (line 549) and emitStatementComplete (line 575) emit sessionId: this.sessionId || ''. emitErrorEvent (line 596) emits sessionId: this.sessionId (no || ''). The aggregator's per-statement state groups by sessionId, so an operation with no sessionId emits start/complete keyed under empty string and errors keyed under undefined — split into two synthetic sessions in the aggregator and exporter.
Fix: pick one default and apply consistently. Prefer undefined and let the aggregator filter at the processEvent boundary (cleaner than synthesizing fake empty-string sessions).
Posted by Code Review Squad.
There was a problem hiding this comment.
Done in 5669c30 — emitStatementStart/emitStatementComplete now pass this.sessionId directly (no || '' coercion), aligned with emitErrorEvent. Emitter signatures and StatementTelemetryDetails updated to accept sessionId?: string, so an operation that briefly lacks a sessionId no longer splits into two synthetic buckets in the aggregator.
Code Review Squad ReportScope: 29 files, +2970/-424 lines Executive SummaryThis is a sizable, mostly well-engineered telemetry layer that ships two verified data-integrity bugs that mis-stamp every connection-close metric and every workspace-attribution field on AWS+Azure customers, plus multi-tenant safety issues (FIFO auth-provider sharing, silent override of opt-out config) that materially undermine the privacy posture of a default-on feature. Tests for the +314-line None of these are unrecoverable — most have small, focused fixes — but they should land before this merges and ships to customers as part of the 1.13 default-on telemetry rollout. Critical & High findings (inline comments above)
High findings without a specific anchorF6 — Default-on telemetry ships customer-controlled F7 — F8 — Telemetry HTTP retry double-stacks with user's 15-min Medium / Low findings (no inline comment, summarized here)
Verified non-issues (confirmed against PR head; flagged so reviewers don't re-raise)
Feedback? Drop it in #code-review-squad-feedback. |
Critical/High: - F1 close-latency now measures closeSession RPC duration, not session lifetime - F2 extractWorkspaceId reads ?o=N from httpPath; returns undefined on miss (host-based extraction shipped wrong values on AWS/Azure) - F3 redact errorStack/errorMessage at emit time so in-process EventEmitter listeners see the same redacted strings the export pipeline does - F4 README "Multi-tenant SaaS deployments" section + warn on diverging auth providers across registrants on the same host - F5 cache auth-provider snapshot at register time; unregister uses cached reference so rotated providers don't leak in the FIFO - F6 redactSensitive(userAgentEntry) before UA construction in exporter - F8 telemetry retry no longer stacks on customer retriesTimeout - F9 getDefaultConfig sources all 15 telemetry keys from DEFAULT_TELEMETRY_CONFIG Mediums: - F10 RowSetProvider.emitChunkEvent counts bytes for column-based, binary-column, and arrow shapes (was arrow-only) - F11 copyDefinedTelemetryOptions helper replaces `as any` cast on overrides - F12 sessionId default consistently undefined across emit sites - F17 prune throwing contexts in TelemetryClient fall-through helpers - F18 explicit MetricsAggregator|undefined return annotation - F19 redaction patterns broadened: /var/lib, /opt, /srv, /app, UNC, macOS /var/folders - F20 e2e telemetry test gets skip-guard for missing creds + tightened assertions (counts, no disjunction-over-spies) - F21 errorMessage redacted at emit time (not only errorStack) - F22 warn on unrecognized DATABRICKS_TELEMETRY_DISABLED values Low / cleanup: - F13 getTelemetryStats() API + warn-level capacity-event summary - F14 typed on/off/once overloads on TelemetryEventEmitter - F15 ship driverConfig only on first openSession per client - F16 typed safeEmit (no more `any` for emitter/context) - F23 superseded by F2 (workspaceId is numeric now) - F24 README defaults section stays abstract — keys reference const - F26 chunkSumLatencyMs aligned with chunkCount presence - F27 compressionEnabled is sticky-OR across chunks, not last-wins - F28 drop redundant Promise.resolve(this.flush(...)) wraps - F29 optional telemetryFlushOnExit beforeExit hook, detached in close - F30 resetInstance renamed __resetInstanceForTests - F31 noted ITelemetrySink follow-up on IClientContext - F32 telemetryMaxPendingMetrics exposed on ConnectionOptions - F36 emitWrapped helper consolidates the per-method scaffold Tests: 26 new DBSQLClient unit tests (env-kill, extractWorkspaceId, refcount on reconnect/init-failure, F15 dedup, getTelemetryStats). 2 new RowSetProvider tests for non-arrow byte counting. e2e test rewritten with skip-guard + tightened assertions. Existing tests updated for new log-message format and snapshot-pair contexts. Co-authored-by: Isaac
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
|
Thanks for the thorough review — Critical & High — addressed
Medium / Low — addressed
Test results
Note: this PR depends on PR #326 (Telemetry Client Management) — the multi-context auth-provider refactor is built on top of |
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
- Re-apply prettier formatting on lib/DBSQLClient.ts, lib/result/CloudFetchResultHandler.ts, lib/result/RowSetProvider.ts (eslint --fix in the prior commit re-introduced quote/destructuring formatting that prettier rewrote back). - Revert the loosening of the "share telemetry client across multiple connections" e2e assertions back to exact-count form now that the PECO test workspace's telemetry feature flag is on. Strict count catches refcount regressions an at.least() assertion would hide. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
…sert CI on the previous commit showed two e2e failures driven by state that leaks across e2e test files, not by real defects: 1. `should initialize telemetry when telemetryEnabled is true` — `featureFlagCacheSpy.called` was false because a prior e2e suite had already created a TelemetryClient for the test host. Subsequent `getOrCreateClient` calls hit the existing holder and skip the constructor (and the FFCache.getOrCreateContext call inside it). Fixed by `before` + `beforeEach` `__resetInstanceForTests()` so this suite always sees a fresh provider regardless of execution order. That also clears the "second context, different auth" F4 warn that was firing on every test in this suite. 2. `should cleanup telemetry on close` — `MetricsAggregator.flush()` isn't called by the close-drain when `pendingMetrics` is empty (the test connects + closes without ever `openSession`-ing, so no events are queued). Replaced the strict `flushSpy.called` assert with `aggregatorCloseSpy.called` — `MetricsAggregator.close()` is the cleanup surface that always runs, regardless of buffer state. Also: change the outer `describe` from `function ()` to arrow form (eslint `prefer-arrow-callback`); only `before`/`it` need the function-form for `this`. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Rebases the docs PR onto main now that #327 has landed. Drops all code changes from the diff; keeps only the new docs (docs/TELEMETRY.md, spec/telemetry-design.md, spec/telemetry-test-completion-summary.md) and swaps README's telemetry section to a short pointer to docs/TELEMETRY.md. Co-authored-by: Isaac
* docs: add telemetry design, test summary, and user guide Rebases the docs PR onto main now that #327 has landed. Drops all code changes from the diff; keeps only the new docs (docs/TELEMETRY.md, spec/telemetry-design.md, spec/telemetry-test-completion-summary.md) and swaps README's telemetry section to a short pointer to docs/TELEMETRY.md. Co-authored-by: Isaac * docs: shorten README telemetry section Replace the verbose bulleted overview with a tight three-sentence blurb that names what's collected, where the opt-outs live, and points readers to docs/TELEMETRY.md for everything else. Co-authored-by: Isaac * docs: rewrite telemetry guides to be terse and accurate - docs/TELEMETRY.md: 717 → 123 lines. Drop duplicate privacy sections, long examples, architecture deep-dive. Fix the inaccurate "disabled by default" claim. Cross-reference IDBSQLClient.ts JSDoc for defaults so the doc can't drift. - spec/telemetry-design.md: 2538 → 125 lines. Drop the implementation checklist, exhaustive proto field table, and inline class bodies that mirrored lib/telemetry/*.ts. Keep architecture diagram, per-component responsibilities, export lifecycle, privacy/error/shutdown invariants. - spec/telemetry-test-completion-summary.md: 648 → 47 lines. Collapse per-component prose into a single table, drop verbose exit-criteria-verification and quality-metrics sections. Co-authored-by: Isaac * Delete spec/telemetry-test-completion-summary.md * docs: prettier-format docs/TELEMETRY.md Align the configuration-options table columns to satisfy prettier's markdown table formatter. Pure formatting; no content change. Co-authored-by: Isaac
Part 4 of 7-part Telemetry Implementation Stack
This layer adds event-driven telemetry emission, per-host shared
aggregation/export, and operation-level error telemetry on top of the
hardened infrastructure shipped in [1/7]–[3/7].
What's in this PR
TelemetryEventEmitter (
lib/telemetry/TelemetryEventEmitter.ts)Per-
DBSQLClientevent emitter — typed emission methods, respects theclient's
telemetryEnabledflag, swallows all exceptions at debug level.Each emitter bridges into the shared aggregator on the per-host
TelemetryClient.Event types:
CONNECTION_OPEN,CONNECTION_CLOSE(new),STATEMENT_START,STATEMENT_COMPLETE,CLOUDFETCH_CHUNK,ERROR.TelemetryClient owns the per-host triad (
lib/telemetry/TelemetryClient.ts)TelemetryClientProvideris a process-wide singleton. Each host gets oneTelemetryClientthat owns:DatabricksTelemetryExporterMetricsAggregatorCircuitBreakerRegistryFeatureFlagCacheMultiple
DBSQLClientinstances on the same host share these — breakercounters and HTTP batches don't fragment per-instance.
TelemetryClientimplements
IClientContextso the owned components have a stable contextthat survives any single
DBSQLClient's close. Connection/auth providersare tracked in a FIFO of registered contexts; the exporter falls through
to the next active one when the head closes.
MetricsAggregator (per-host, on
TelemetryClient)Restored from
main's hardened version, with new functionality layered on:DELETE_SESSIONconnection metric.chunkInitialLatencyMs,chunkSlowestLatencyMs,chunkSumLatencyMsaccumulated fromCLOUDFETCH_CHUNK events with positive latency.
maxPendingMetrics(drop preferring non-error to keepfirst-failure signal),
maxErrorsPerStatement,statementTtlMseviction.unref()'d), terminal-errorimmediate flush, manual
flush().close()is async and awaits the final HTTP POST soawait client.close(); process.exit(0)doesn't truncate the last batch.Error telemetry wired into operation entry points
DBSQLOperationnow emitsERRORevents (withExceptionClassifierterminal/retryable classification) from
fetchChunk,cancel,close,and
getMetadata. Failed queries produce aSTATEMENT_COMPLETEplus anERRORproto witherror_info: { error_name, stack_trace }(stack runthrough
redactSensitive).emitStatementCompleteno longer issues agetMetadataThrift RPC onclose (perf regression + spurious-error-telemetry trap).
Type-safe wiring (
IClientContext)Added optional
getTelemetryEmitter()/getTelemetryAggregator()toIClientContext. Removed all(this.context as any)casts at the sevenemit call sites (
DBSQLOperation,DBSQLSession,RowSetProvider,CloudFetchResultHandler).The six copy-pasted listeners in
DBSQLClient.initializeTelemetryare nowone bridge loop over
Object.values(TelemetryEventType)— closes thelistener-name mismatch that originally caused error events to be silently
dropped.
mapAuthTypecovers all sixauthTypevalues (access-token,databricks-oauth{U2M/M2M},
custom,token-provider,external-token,static-token).Verified end-to-end against an Azure Databricks workspace
Healthy
SELECT 1produces 3 wire metrics:CREATE_SESSION(withsystem_configuration),STATEMENT_COMPLETE(withsql_operation.execution_result),DELETE_SESSION.Failed query produces 4:
CREATE_SESSION,STATEMENT_COMPLETE(latencyonly),
ERROR(with redacted stack),DELETE_SESSION.Server-side feature flag is still the kill switch —
telemetryEnabled: falseon the client also skips the entire pipeline (no acquire/release noise).
Testing
484 unit tests pass (telemetry + DBSQLClient/Operation/Session/result).
Test files for the rebased modules are restored from
main. Provider testsupdated for the singleton API.
Coverage gap acknowledged: no unit tests yet for the re-applied
chunk-timing aggregation or
CONNECTION_CLOSEhandling. Adding these as afollow-up to the same epic.
Dependencies
Depends on:
Out of scope (open follow-ups)
operation_typeproto field soCREATE_SESSIONandDELETE_SESSIONare explicitly distinguished on the wire (today they're distinguishable
only by the incidental presence of
system_configuration).getStats()for telemetry-pipeline self-observability (drop counts,queue depth, last-success timestamp).
ConnectionOptions(currently3 of 13).
DBSQLClient,close from
DBSQLSession).CONNECTION_CLOSE.